massive data
Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference on tens of millions of examples using Hadoop.
Fast Distributed k-Center Clustering with Outliers on Massive Data
Clustering large data is a fundamental problem with a vast number of applications. Due to the increasing size of data, practitioners interested in clustering have turned to distributed computation methods. In this work, we consider the widely used k-center clustering problem and its variant used to handle noisy data, k-center with outliers. In the noise-free setting we demonstrate how a previously-proposed distributed method is actually an O(1)-approximation algorithm, which accurately explains its strong empirical performance. Additionally, in the noisy setting, we develop a novel distributed algorithm that is also an O(1)-approximation. These algorithms are highly parallel and lend themselves to virtually any distributed computing framework. We compare both empirically against the best known noisy sequential clustering methods and show that both distributed algorithms are consistently close to their sequential versions. The algorithms are all one can hope for in distributed settings: they are fast, memory efficient and they match their sequential counterparts.
Positive region preserved random sampling: an efficient feature selection method for massive data
Bai, Hexiang, Li, Deyu, Liang, Jiye, Zhai, Yanhui
Selecting relevant features is an important and necessary step for intelligent machines to maximize their chances of success. However, intelligent machines generally have no enough computing resources when faced with huge volume of data. This paper develops a new method based on sampling techniques and rough set theory to address the challenge of feature selection for massive data. To this end, this paper proposes using the ratio of discernible object pairs to all object pairs that should be distinguished to measure the discriminatory ability of a feature set. Based on this measure, a new feature selection method is proposed. This method constructs positive region preserved samples from massive data to find a feature subset with high discriminatory ability. Compared with other methods, the proposed method has two advantages. First, it is able to select a feature subset that can preserve the discriminatory ability of all the features of the target massive data set within an acceptable time on a personal computer. Second, the lower boundary of the probability of the object pairs that can be discerned using the feature subset selected in all object pairs that should be distinguished can be estimated before finding reducts. Furthermore, 11 data sets of different sizes were used to validate the proposed method. The results show that approximate reducts can be found in a very short period of time, and the discriminatory ability of the final reduct is larger than the estimated lower boundary. Experiments on four large-scale data sets also showed that an approximate reduct with high discriminatory ability can be obtained in reasonable time on a personal computer.
- Asia > China > Shanxi Province (0.14)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)
Distributed Submodular Cover: Succinctly Summarizing Massive Data
How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition preva- lent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is imprac- tical for truly large-scale problems. In this work, we develop the first distributed algorithm – DISCOVER – for submodular set cover that is easily implementable using MapReduce-style computations.
Datalike: Interview with Angelique Yameogo
Angelique Yameogo is studying for a PhD at the University of South Brittany in France. Her thesis is focused on fake news analysis using data science techniques. She has worked with several companies in Burkina Faso as an artificial intelligence engineer and mobile developer. She is skilled in HTML, CSS, JavaScript, pandas, sci-kit-learn, NLTK and others. Through networking, you can also access hidden opportunities and keep abreast of trends and developments in your field.
- Europe > France (0.25)
- Africa > Burkina Faso (0.25)
Cloud Software Engineer 3
A Bachelor's Degree in Computer Science or in a related technical field is highly desired which will be considered equivalent to two (2) years of experience A Master's degree in a Technical Field will be considered equivalent to four (4) years of experience A degree in Mathematics, Information Systems, Engineering, or similar degree will be considered as a technical field Eight (8) years of experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution and at least six (6) years of experience developing software with high level languages such as Java, C, C Demonstrated ability to work with OpenSource (NoSQL) products that support highly distributed, massively parallel computation needs such as Hbase, Acumulo, Big Table, etcAd: Ready to find your dream job? Use this free career assessment test to figure it out. Peraton drives missions of consequence spanning the globe and extending to the farthest reaches of the galaxy As the world's leading mission capability integrator and transformative enterprise IT provider, we deliver trusted and highly differentiated national security solutions and technologies that keep people safe and secureAd: Stop spending hours editing your resume to fit job descriptions. Peraton serves as a valued partner to essential government agencies across the intelligence, space, cyber, defense, civilian, health, and state and local markets Every day, our employees do the can't be done, solving the most daunting challenges facing our customers For Colorado Residents: Colorado Salary Minimum: $90,500 Colorado Salary Maximum: $219,700 The estimate displayed represents the typical salary range for this position, and is just one component of Peraton's total compensation package for employees Other rewards may include annual bonuses, short- and long-term incentives, and program-specific awards In addition, Peraton provides a variety of benefits to employees
- Government (0.78)
- Information Technology (0.58)
To evolve, AI must face its limitations
From medical imaging and language translation to facial recognition and self-driving cars, examples of artificial intelligence (AI) are everywhere. And let's face it: although not perfect, AI's capabilities are pretty impressive. Even something as seemingly simple and routine as a Google search represents one of AI's most successful examples, capable of searching vastly more information at a vastly greater rate than humanly possible and consistently providing results that are (at least most of the time) exactly what you were looking for. The problem with all of these AI examples, though, is that the artificial intelligence on display is not really all that intelligent. While today's AI can do some extraordinary things, the functionality underlying its accomplishments works by analyzing massive data sets and looking for patterns and correlations without understanding the data it is processing. As a result, an AI system relying on today's AI algorithms and requiring thousands of tagged samples only gives the appearance of intelligence.
To evolve, AI must face its limitations
From medical imaging and language translation to facial recognition and self-driving cars, examples of artificial intelligence (AI) are everywhere. And let's face it: although not perfect, AI's capabilities are pretty impressive. Even something as seemingly simple and routine as a Google search represents one of AI's most successful examples, capable of searching vastly more information at a vastly greater rate than humanly possible and consistently providing results that are (at least most of the time) exactly what you were looking for. The problem with all of these AI examples, though, is that the artificial intelligence on display is not really all that intelligent. While today's AI can do some extraordinary things, the functionality underlying its accomplishments works by analyzing massive data sets and looking for patterns and correlations without understanding the data it is processing.
CVAT Annotation
Machine learning model structuring and processing is not as easy as it may sound. Without the availability of required data, it is difficult to imagine the accuracy of results. At the core of several AI programs wherein complex computations are done, machine learning algorithms also enable the systematic rendering of learning tasks. As much as the quality of data is central to an algorithm, following the stages of applying the data for performance decides the accuracy of prediction. Whether the data is limited or available in ample amounts, imagining data annotation manually isn't a practical solution when business demands are changing rapidly.
WHAT ARE THE BIGGEST CHALLENGES IN ARTIFICIAL INTELLIGENCE AND HOW TO SOLVE THEM?
Artificial intelligence (AI) is set to change how the world works. Although it's not perfect, artificial intelligence is a gamer changer. AI is the main engine of the digital revolution. The COVID-19 crisis has accelerated the need for human-machine digital intelligent platforms facilitating new knowledge, competences and workforce skills, advanced cognitive, scientific, technological, and engineering, social, and emotional skills. In the AI and Robotics era, there is a high demand for the scientific knowledge, digital competence, and high-technology training in a range of innovative areas of exponential technologies, such as artificial intelligence, machine learning and robotics, data science and big data, cloud and edge computing, the Internet of Thing, 5G, cybersecurity and digital reality.